The purpose of this markdown document is to list the steps we followed for refining the models using climate data only.
The data used in the below models are described in the Data Wrangle folder.
Note the below analysis uses the iNat data with 1510 observations. Amazing!
Below we compare models using canopy symptoms as the response variable
There are multiple methods to group the response variables deepening on desired resolution or fineness of the model.
For now, we can move forward with the binary response grouping because it is the broadest and presumably the easiest for the model to classify with.
All tree health categories
## # A tibble: 12 × 2
## # Groups: field.tree.canopy.symptoms [12]
## field.tree.canopy.symptoms n
## <fct> <int>
## 1 Branch Dieback or 'Flagging' 74
## 2 Browning Canopy 43
## 3 Candelabra top or very old spike top (old growth) 2
## 4 Extra Cone Crop 4
## 5 Healthy 813
## 6 Multiple Symptoms (please list in Notes) 31
## 7 New Dead Top (red or brown needles still attached) 48
## 8 Old Dead Top (needles already gone) 154
## 9 Other (please describe in Notes) 16
## 10 Thinning Canopy 225
## 11 Tree is dead 74
## 12 Yellowing Canopy 26
We also need to filter the data to only include response and explanatory variables we’re interested in. For example, whether a sound clip was included in the iNat data is not important.
We also need to remove other response variables like “field.percent.canopy.affected….” so it is not used as a predictor for tree health.
Note it might be interesting to know if the user was an important factor in predicting if the tree is healthy/unhealthy.
There are also a number of factors that should probably be removed because they may be biasing the data. For example, only trees with the ‘other factor’ question may only be answered for unhealthy trees. We need to think about this a bit more.
Remove variables with variables that have near zero standard deviations (entire column is same value)
Binary tree health categories
## # A tibble: 2 × 2
## # Groups: field.tree.canopy.symptoms [2]
## field.tree.canopy.symptoms n
## <fct> <int>
## 1 Healthy 813
## 2 Unhealthy 695
##
## Call:
## randomForest(formula = field.tree.canopy.symptoms ~ ., data = binary, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 21
##
## OOB estimate of error rate: 32.56%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 581 232 0.2853629
## Unhealthy 259 436 0.3726619
##
## Call:
## randomForest(formula = field.tree.canopy.symptoms ~ ., data = monthless.binary, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 12
##
## OOB estimate of error rate: 33.29%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 580 233 0.2865929
## Unhealthy 269 426 0.3870504
##
## Call:
## randomForest(formula = field.tree.canopy.symptoms ~ ., data = normal.monthless.binary, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 32.96%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 581 232 0.2853629
## Unhealthy 265 430 0.3812950
Remove variables with variables that have near zero standard deviations (entire column is same value)
##
## Call:
## randomForest(formula = field.number.of.additional.unhealthy.trees..of.same.species..in.area..within.sight. ~ ., data = combo.filtered.nearzerovar, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 21
##
## OOB estimate of error rate: 41.99%
## Confusion matrix:
## 0 1 2-3 4-6 7-10 More than 10 Not sure class.error
## 0 765 47 43 28 7 13 12 0.1639344
## 1 86 15 10 11 2 2 1 0.8818898
## 2-3 96 9 26 18 4 4 3 0.8375000
## 4-6 62 5 17 49 7 11 1 0.6776316
## 7-10 14 1 5 5 8 8 1 0.8095238
## More than 10 41 1 6 11 5 11 0 0.8533333
## Not sure 29 1 3 2 1 1 2 0.9487179
##
## Call:
## randomForest(formula = field.number.of.additional.unhealthy.trees..of.same.species..in.area..within.sight. ~ ., data = monthless.combo.filtered.nearzerovar, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 12
##
## OOB estimate of error rate: 41.92%
## Confusion matrix:
## 0 1 2-3 4-6 7-10 More than 10 Not sure class.error
## 0 766 46 43 29 7 13 11 0.1628415
## 1 87 16 10 9 2 2 1 0.8740157
## 2-3 96 10 26 17 4 4 3 0.8375000
## 4-6 63 8 15 47 7 12 0 0.6907895
## 7-10 15 1 4 5 8 8 1 0.8095238
## More than 10 38 2 7 10 5 12 1 0.8400000
## Not sure 27 1 4 2 1 2 2 0.9487179
##
## Call:
## randomForest(formula = field.number.of.additional.unhealthy.trees..of.same.species..in.area..within.sight. ~ ., data = normal.monthless.combo.filtered.nearzerovar, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 41.99%
## Confusion matrix:
## 0 1 2-3 4-6 7-10 More than 10 Not sure class.error
## 0 768 48 40 30 6 13 10 0.1606557
## 1 88 16 10 9 2 1 1 0.8740157
## 2-3 96 8 26 19 4 4 3 0.8375000
## 4-6 62 8 16 47 6 13 0 0.6907895
## 7-10 13 2 5 4 8 9 1 0.8095238
## More than 10 38 2 6 14 5 9 1 0.8800000
## Not sure 28 1 3 1 1 3 2 0.9487179